arXiV: 1504.00064v 1 [stat.ML] 31 Mar 2015 


Crowdsourcing Feature Discovery via 
Adaptively Chosen Comparisons 


James Y. Zou Kamalika Chaudhuri Adam Tauman Kalai 

Microsoft Research University of California, San Diego Microsoft Research 
Cambridge, MA San Diego, CA Cambridge, MA 


Abstract 

We introduce an unsupervised ap¬ 
proach to efficiently discover the un¬ 
derlying features in a data set via 
crowdsourcing. Our queries ask crowd 
members to articulate a feature com¬ 
mon to two out of three displayed 
examples. In addition we also ask 
the crowd to provide binary labels 
to the remaining examples based on 
the discovered features. The triples 
are chosen adaptively based on the la¬ 
bels of the previously discovered fea¬ 
tures on the data set. In two natu¬ 
ral models of features, hierarchical and 
independent, we show that a simple 
adaptive algorithm, using “two-out-of- 
three” similarity queries, recovers all 
features with less labor than any non- 
adaptive algorithm. Experimental re¬ 
sults validate the theoretical findings. 


1 Introduction 

Discovering features is essential to the success of 
machine learning and statistics. Crowdsourcing 
can be used to discover these underlying fea¬ 
tures, in addition to merely labeling them on 
data at hand. This paper addresses the follow¬ 
ing unsupervised learning problem: given a data 
set, using as few crowd queries as possible, elicit 
a diverse set of salient, feature names along with 


Which two are similar and why? “one handed” 



Tags Tags Tags 

signal, motion, man, gesture, man, goatee, 

balding, beard hand movement sign language 


Figure 1: Comparing three examples yields a 
useful feature whereas tagging them separately 
yields nondiscriminative features. 

their labels on that data set. For example, on 
a data set of faces, salient features might corre¬ 
spond to gender, the presence of glasses, facial 
expression, any numerous others. In this pa¬ 
per we focus on binary features, each of which 
can be thought of as a function mapping data 
to {0,1}. The term feature name refers to a 
string describing the feature (e.g., male or wear¬ 
ing glasses)^ and the label of a feature on an 
example refers the {0, l}-value of that feature 
on a that datum, as annotated by crowd work¬ 
ers. Features are useful in exploratory analysis, 
for other machine learning tasks, and for brows¬ 
ing data by filtering on various facets. While 
the features we use are human-generated and 
human-labeled, they could be combined with 
features from machine learning, text analysis, 
or computer vision algorithms. In some cases. 


















features provide a significantly more compact 
representation than other unsupervised repre¬ 
sentations such as clustering, e.g., one would 
need exponentially many clusters (such as smil¬ 
ing white men with grey hair wearing glasses) to 
represent a small number of features. 

A widely-used crowdsourcing technique for elic¬ 
iting features is to simply ask people to tag data 
with multiple words or phrases. However, tag¬ 
ging individual examples fails to capture the dif¬ 
ferences between multiple images in a data set. 
To illustrate this problem, we asked 10 crowd 
workers to tag 10 random signs from an online 
dictionary of American Sign Language [1], all 
depicted by the same bearded man in a gray 
sweatshirt. As illustrated in Figure 1, the tags 
generally refer to his hair, clothes, or the general 
fact that he is gesturing with his hands. Each 
of the 33 tags could apply equally well to any of 
the 10 video snips, so none of the features could 
discriminate between the signs. 

Inspired by prior work naiziiii] and the famil¬ 
iar kindergarten question, “which one does not 
belong?”, we elicit feature names by presenting 
a crowd worker with a triple of examples and 
asking them to name a feature eommon to any 
two out of the three examples. We refer to this 
as a “two-out-of-three” or, more succinctly, 2/3 
query. These features are meant to differentiate 
yet be common as opposed to overly specific 
features that capture peculiarities rather than 
meaningful distinctions. As shown in Figure 1, 
in contrast to tagging, the learned features par¬ 
tition the data meaningfully. 

How should one choose such triples? We find 
that, very often, random triples redundantly 
elicit the same set of salient features. For exam¬ 
ple, 60% of the responses on random sign triples 
distinguish signs that use one vs. two hands. To 
see why, suppose that there are two “obvious” 
complimentary features, e.g., male and female^ 
which split the data into two equal-sized parti¬ 
tions and are more salient than any other, i.e., 
people most often notice these features first. If 
the data are balanced, then 75% of triples can 


be resolved by one of these two features. 

To address this inefficiency, once we’ve discov¬ 
ered a feature, e.g., one/two-handedness, we 
then ask crowd workers to label the remaining 
data according to this feature. This labeling is 
necessary eventually, since we require the data 
to be annotated according to all discovered fea¬ 
tures. Once we have labels for the data, we 
never perform a 2/3 query on resolved triples, 
i.e., those for which we have a feature whose 
labels are positive on two out of the three ex¬ 
amples. Random 2/3 queries often result in the 
one of these salient features. Our adaptive al¬ 
gorithm, on the other hand, after learning the 
features of, say, “male” and “female,” always 
presents three faces labeled by the same gen¬ 
der (assuming consistent labeling) and thereby 
avoids eliciting the same feature again (or func¬ 
tionally equivalent features such as “a man”). 

The face data set also illustrates how some 
features are hierarchical while others are or¬ 
thogonal. For instance, the feature “bearded” 
generally applies only to men, while the fea¬ 
ture “smiling” is common across genders. We 
analyze our algorithm and show that it can 
yield large savings both in the case of hier¬ 
archical and orthogonal features. Proposition 
4.1 states that our algorithm finds all M fea¬ 
tures of a proper binary hierarchical “feature 
tree” using M queries, whereas Proposition 4.2 
states that any non-adaptive algorithm requires 
queries. The lower bound also suggests 
that “generalist” query responses are more chal¬ 
lenging than “specifics,” e.g., in comparing a 
goat, a kangaroo, and a car, the generalist may 
say that the goat and kangaroo are both ani¬ 
mals rather while the specifist may distinguish 
them as both mammals. We then present a 
more sophisticated algorithm that recovers D- 
ary trees on M features and N examples using 
O^N-l-MD^^) queries, with high probability (see 
Proposition 4.3). 

Finally, we show that in the case of M indepen¬ 
dent random features, adaptivity can give an 
exponential improvement provided that there is 
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sufficient data (Lemmas 5.2 and 5.3). For ex¬ 
ample, in the case of M independent uniformly 
random features, our algorithm finds all features 
using fewer than 3M queries (in expectation) 
compared to a 12(1.6^) for a random triple al¬ 
gorithm. In all analysis, we do not include the 
cost of labeling the features on the data since 
this cost must be incurred regardless of which 
approach is used for feature elicitation. More¬ 
over, the labeling cost is modest as workers took 
less than one second, amortized, to label a fea¬ 
ture per image when batched (prior work [10] re¬ 
ported batch labeling for approximately $0,001 
per image-feature label). 


Interestingly, our theoretical findings imply that 
2/3 queries are sufficient to learn in both our 
models of hierarchial and independent features, 
with sufficient data. We also discuss 2/3 queries 
in comparison to other types, e.g., why not ask a 
“1/3 query” for a feature that distinguishes one 
example from two others? Note that 1/3 and 
2/3 queries may seem mathematically equiva¬ 
lent if the negation of a feature is allowed (one 
could point out that two are “not wearing green 
scarves”). However, research in psychology does 
not find this to be the case for human responses, 
where similarity is assumed to be based on the 
most common positive features that examples 
share (see, e.g., Tversky’s theory of similariteies 
m)- Proposition |6.1| shows that there are data 
sets where larger arbitrarily large query sizes are 
necessary to elicit certain features. 


The paper is organized as follows. After dis¬ 
cussing related work, we define the hierarchical 
model in Section 2. In Section 3, we define the 
adaptive triple algorithm and the baseline (non- 
adaptive) random triple algorithm. In Section 
4, we bound the number of queries of these algo¬ 
rithms in the case of hierarchical features. The 
performance under independent features are an¬ 
alyzed in Section 5. Section 6 considers alterna¬ 
tive types of queries. Experimental results are 
presented in Section 7. 


2 Related work 

In machine learning and AI applications |9| , rel¬ 
evant features are often elicited from domain 
experts mini or from text mining |3| . As men¬ 
tioned, a common approach for crowdsourcing 
named features is image tagging, see, e.g., the 
ESP game m- There is much work on auto¬ 
matic representation learning and feature selec¬ 
tion from the data alone (see, e.g., m) , but these 
literatures are too large to summarize here. 

One work that inspired our project was that 
of Patterson and Hays m, who crowdsourced 
nameable attributes for the SUN Database of 
images using comparative queries. They pre¬ 
sented workers with random quadruples of im¬ 
ages from a data set separated vertically and 
elicited features by asking what distinguishes 
the left pair from the right. Their images were 
chosen randomly and hence without adaptation. 
They repeated this task over 6,000 times. We 
discuss such left-right queries in Section 

For supervised learning, Parikh and Grauman 
[9] address multi-class classification by iden¬ 
tifying features that are both nameable and 
machine-approximable. They introduce a novel 
computer vision algorithm to predict “namabil- 
ity” of various directions in high-dimensional 
space and present users with images ordered 
by that direction. Like ours, their algorithm 
adapts over time, though their underlying prob¬ 
lem and approach are quite different. In inde¬ 
pendent work on crowdsourcing binary classifi¬ 
cation, Cheng and Bernstein jl] elicit features 
by showing workers a random pair of positive 
and negative example. They cluster the features 
using statistical text analysis which reduces re¬ 
dundant labeling of similar features (which our 
algorithm does through adaptation), but it does 
not solve the problem that a large number of 
random comparisons are required in order to 
elicit fine-grained features. They also introduce 
techniques to improve the feature terminology 
and clarify feature definitions, which could be 
incorporated into our work as well. 
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Finally, crowdsourced feature discovery is a 
human-in-the-loop form of unsupervised dictio¬ 
nary learning (see, e.g., El). Analogous to 
the various non-featural representations of data, 
crowdsourcing other representations has also 
been studied. For hierarchical clustering, a 
number of algorithms have been proposed (see, 
e.g., Chilton et al El). Also, Kernel-based sim¬ 
ilarity representations have been crowdsourced 
adaptively as well m- 

3 Preliminaries and Definitions 

We first assume that there is a given set X = 
{xi, X 2 ,... ^xn} of examples (images, pieces of 
text or music, etc.) and an unknown set T — 
of binary features fj : X ^ 
{0,1} to be discovered. We say that feature 
fj is present in an example G X if fj{x) — 1, 
absent if fj{x) = 0, and we abuse notation and 
write Xij = f{xi) and Xij = fj{xi). Hence, 
since there are M hidden features and N exam¬ 
ples, then there is an underlying latent N-hj-M 
feature allocation matrix A with binary entries. 
The ith row of A corresponds to sample Xi, and 
the jth column of A corresponds to feature fj. 

Our goal is to recover this entire matrix A, to¬ 
gether with names for the features, using mini¬ 
mal human effort. 

Definition Given a feature / and an example 
Xi, a labeling query L{xi,f) returns f{xi). 

As we will discuss, in practice labeling is per¬ 
formed more efficiently in batches. A consid¬ 
eration for query design is that we want each 
contrastive query to be as cognitively simple as 
possible for the human worker. Our analysis 
suggests that comparisons of size three suffice, 
but for completeness we define comparisons on 
pairs as well. 

Definition A 2/3 query Q{x^y^z) either re¬ 
turns a feature / G X such that f{x) + /(?/) + 
f{z) = 2 or it returns NONE if no such feature 
exists. 



Figure 2 : A sample proper binary feature tree. 
When comparing the pen, flower, and tree, the 
distinguishing features are natural and plant. A 
generalist would respond with natural. 


All2 query on Q{x^ y) either returns a feature 
/ G X such that f{x) -h /(y) = 1 or returns 
NONE if X and y are identical. 


We also refer to 2/3 queries as triple queries 
and 1/2 queries as pair queries. Note that we 
can simulate a pair query Q{x^y) by two triple 
queries Q{x^x^y) and Q{x^y^y). We say that 
a feature / distinguishes a set of examples S 
if ~ 1*5"! “ I 5 i- 6-5 if holds for all but 

one example in S. 


Definition A query is resolved if there is a 
known distinguishing feature for the query, or 
it is known that NONE is the outcome of the 
query. 

Algorithm [T| the Adaptive Triple Algorithm, is 
the main algorithm we use for experimentation 
and analysis (though we also analyze a more 
advanced algorithm. 



4 Hierarchical Feature Models 

We now consider the setting where the features 
and examples form a tree, with each internal 
node (other than the root) corresponding to a 
single feature and each leaf corresponding to a 
single example. The features that are 1 for an 
example are defined to be those on the path to 
the root, and the others are 0. The root is not 
considered a feature. Hence, if feature / is an 
ancestor of then ^ < / in that whenever g is 
1, / must be 1 as well. 

Algorithm 1 Adaptive Triple 
Input: Examples X = {xi}. 

Output: A set of features F — {/} and their 
corresponding labels on all examples xij for 
i<NJ eF. 

1: Randomly select a triple {x^y^z} from the 
set of all unresolved triple queries. Let / = 
Q{x,y,z). 

2: If / ^NONE: (a) add it to F, (b) run the 
labeling query L{xi^ f) for all G X, and 
(c) update the set of unresolved queries. 

3 : If all all triples of examples can be resolved 
by one of the discovered features, terminate 
and output F and the labels. Otherwise, go 
to 1. 


Definition A feature tree T is a rooted tree in 
which each internal node (aside from the root) 
corresponds to a distinct feature and each leaf 
corresponds to a distinct example. The value 
of a feature on an example is 1 if the node cor¬ 
responding to that feature is on the the path 
to the root from the leaf corresponding to the 
example, and 0 otherwise. 

Note that our algorithms recover the features 
but not the tree explicitly - reconstructing the 
corresponding feature tree is straightforward if 
the data is consistent with one. 

4.1 Binary feature trees 

In this section, we consider the standard no¬ 
tion of proper binary trees in which each inter¬ 


nal node has exactly two children. Figure [^il¬ 
lustrates a proper binary feature tree. 

Proposition 4.1. For a proper binary feature 
tree on M features, the Adaptive Triple algo¬ 
rithm finds all features in M queries. 

Proof. To prove this proposition, we will show 
that: (a) we never receive a NONE response 
in the Adaptive Triple Algorithm, and (b) ev¬ 
ery feature has at least one triple for which it 
is the unique distinguishing feature. Since a 
query in this algorithm cannot return an already 
discovered feature, and since there are M fea¬ 
tures, this implies that there must be exactly 
M queries. 

For (a), let / be the least common ancestor of 
an example triple {x, y, z}. Since T is proper, / 
must have exactly two children. By the def¬ 
inition of least common ancestor, two out of 
{x,y,z} must be beneath one child of / (call 
this child g) while the other one is beneath the 
other child. Then ^ is a distinguishing feature 
for Q{x,y,z). Hence, we should never receive a 
NONE response. 

For (b), observe that every internal node (other 
than the root) has at least one triple for which 
it is the unique distinguishing feature. In par¬ 
ticular, given any internal node, /, let I and r 
be its left and right children. Let x and y be 
examples under I and r (with possibly x — I oi 
y — r ii I OT r are leaves). Let s be the sibling of 
/ (the other child of its parent) and let z be any 
leaf under of s (again z = 5 if 5 is a leaf). Then 
it is clear that / is the unique distinguishing 
triple for x, y, and z. For example, in Figure 
for the feature plant, a triple such as the flower, 
tree, and flsh, would uniquely be distinguished 
by plant. □ 

Now consider different ways to answer queries: 
deflne a generalist as an oracle for Q that re¬ 
sponds to any query with the shallowest distin¬ 
guishing feature, i.e., the one closest to the root. 
For example, given the pen, ffower and tree of 
Figurethe generalist would point out that the 





flower and tree are both natural rather than that 
they are both plants. Also, say an algorithm is 
non-adaptive if it specifies its queries in advance, 
i.e., the triples cannot depend on the answers 
to previous queries but could be random. We 
also assume that the data is anonymous which 
means that we can think of the specific exam¬ 
ples being randomly permuted in secret before 
being given to the algorithm. 

We now show that any general-purpose non- 
adaptive algorithm that does not exploit any 
content information on the examples requires 
at least examples to find all M features 

and at least Q{M^) if all queries are answered 
by generalists. 

Proposition 4.2. If the examples eorrespond 
to a random permutation of the leaves of a 
proper binary tree T with M features, then any 
non-adaptive algorithm requires at least M^/12 
queries to reeover all M features with probabil¬ 
ity 1/2. Furthermore, if queries are answered 
by generalists, then any non-adaptive algorithm 
requires at least M^/24 queries to find all fea¬ 
tures with probability 1/2. 

Figure sheds light on this proposition - in or¬ 
der to discover the feature bird, we mush choose 
both birds in a triple. If the queries are an¬ 
swered by a generalist, we would have to choose 
the birds and fish. The probability of choos¬ 
ing two specific examples is 0(1/M^) while the 
probability of choosing three specific examples 
is 0(1/M3). 


Proof. Let / be the deepest feature (or one of 
them if there are more than one). Let / have 
children x and y which must be leaves since / 
is a deepest internal node. Let s be the sibling 
of /. By assumption x and y are leaves. Now, 
in order to discover /, the triple must consist of 
X and y and another node, which happens with 
probability {N — 2)/(^) = n{N-i) ^ 6/M^ for 
a random triple (since N — M -\- 2). By the 
union bound, if there are only M^/12 triples, it 
will fail to discover / with probability at least 
1/2. 
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Now consider a generalist answering queries. 
Let S be the set of leaves under s. Since / is the 
deepest feature, S must be a set of size 1 or 2 
depending on whether or not s is a leaf. It is not 
difficult to see that the only triples that return 
/ (for a generalist) are x, y and an element of S. 
Hence there are at most 2 triples that recover /. 
Since there are (^) > M^/Q triples, if there are 
fewer than M^/24 triples, then the probability 
that any one of them is equal to one of the two 
target triples is at most 1/2. The union bound 
completes the proof. □ 


Note that pairs are insufficient to recover in¬ 
ternal nodes in the case where a specifist an¬ 
swers queries. This motivates the need for 
triples; moreover. Proposition 4.1 shows that 
triple queries suffice to discover all the features 
in a binary feature tree. 


4.2 General feature trees 

We now present a theoretical algorithm using 
triple queries which allows us to efficiently learn 
general “L)-ary leafy feature trees,” which we 
define to be a feature tree in which: (a) every in¬ 
ternal node (i.e., feature) has at most D internal 
nodes (but arbitrarily many leaves) as children, 
and (b) no internal node has a single child which 
is an internal node. Condition (a) is simply a 
generalization of the standard branching factor 
of a rooted tree, and condition (b) rules out any 
“redundant” features, i.e., features which take 
the same value for each example. 

Proposition 4.3 (Adaptive Hybrid, Upper 
Bound). Let T be a D-ary leafy feature tree 
with N examples and M features. The Adap¬ 
tive Hybrid algorithm with exploration time 9 — 
3D^ log ^ terminates after 0{N -\- MD^ log ^) 
number of triple queries and finds all features 
with probability >1 — 6. 

The proof of Proposition |4.3| makes use of the 
following Lemma. 

Lemma 4.4. Let T be a non-star. D-ary leafy 
feature tree. Then the Random Triple algorithm 





Algorithm 2 Adaptive Hybrid 
Input: Examples X = {xi} and an exploration 
parameter 9. 

Output: The set of features F and labels for 
all examples xij. 

1: Query pairs of examples until we have, for 
each pair, found a feature that distinguishes 
them, or determined that they have identi¬ 
cal features (by direct comparison or tran¬ 
sitivity). 

2: Maintain a queue Q of features to explore, 
and a queue of already discovered features 
F. Initialize Q — {r}, where r is a default 
root feature defined as: Xir = l,Vi G X. 
Initialize E = {}. 

3 : while Queue Q is not empty do 

4 : Pop a feature / from Q. Set off(/) = 

{fj s.t. /3/' with fj < /' < /}. Rep¬ 
resent each feature fj in off(/) by a 
randomly selected example xj such that 
xjjj = 1- 

5 : Uniformly randomly select distinct exam¬ 

ples x^y^z G off(/), and query {x^y^z}. 
If the query returns a feature /', push /' 
to Q, run labeling queries {x,/'} for all 
X G off(/) and update off(/). 

6: If Step 5 returns 9 consecutive NONEs^ 

then add f to F and go to Step 4 and 
pop the next feature from the Q. 

7 : end while 

8: return F and the labels Xij. 


finds at least one feature with probability >1 — 6 
using 3D^ log | queries. 

Due to space limitations, the proofs are deferred 
to the appendix. 

5 Independent features 

In this section we consider examples drawn from 
a distribution in which different features are 
independent. Consider a statistical model in 
which there is a product distribution D over a 
large set of examples X. This model is used to 
represent features that are independent of one 


other. An example of two independent features 
in the Faces data set might be “Smiling” and 
“Wearing Glasses.” We assume that D is a 
product distribution over M independent fea¬ 
tures. Thus D can be described by a vector 
{Pfif ^ where for any feature f ^ 

Pf = Pra:~D [/( x) = l]. We also abuse notation 
and write pi for pf.. We assume 0 < < 1. 

In this model, there is a concern about how 
much data is required to recover all the fea¬ 
tures. In fact, for certain features there might 
not even be any triples among the data which 
elicit them. To see this, consider a homogenous 
erowd that all answers queries according to a 
fixed order on features. Formally, if more than 
one feature distinguishes a triple, suppose the 
feature that is given is always the distinguish¬ 
ing feature fi of smallest index i. Intuitively, 
this models a situation where features are rep¬ 
resented in decreasing salience, i.e., differences 
in the first feature (like gender) are significantly 
more salient than any other feature, differences 
in the second feature stand out more than any 
feature other than the first, and so forth. Now, 
also suppose that all features have probability 
1/2 of being positive. 

Lemma 5.1. //pi = p 2 = ••• = Pm = 1/2; 
then with a homogeneous erowd, N > 1.1^ 
examples are required to find all features with 
probability 1/2 even if all triples are queried. 

Proof. Since pi — 1/2, the probability of any 
feature distinguishing a triple is 3/8. Therefore, 
a homogenous crowd will only output the last, 
least salient feature if it the only distinguishing 
feature, which happens with exponentially small 
probability (3/8)(5/8)^“^ for a random triple. 
Given N < 1.1^ examples, there < 1.1^^ 
triples. By the union bound, with probability 
less than (3/8)(5/8)^“^l.l^^ < 1/2 will any of 
them elicit the last feature. □ 

On the other hand, we show that all features 
will be discovered with a finite number of sam¬ 
ples. In particular, say a feature / is identifiable 
on a data set if there exists a triple such that 



/ is the unique distinguishing feature. If it is 
identifiable, then of course the adaptive triple 
algorithm will eventually identify it. We now 
argue that, given sufficiently many examples, 
all features will be identifiable with high prob¬ 
ability. 

Lemma 5.2 (Identifiability in the Indepen¬ 
dent Features Model). Suppose N examples are 
drawn iid from the Independent Features Model 
where feature f has frequeney pf. For any fea¬ 
ture f, let: 

Tf = Sp}{l - Pf) n (1 - - Pg)) ■ 

g^f 

Moreover, let Tmin = If N > 

fi(log(l/rniin)/^min); then, with eonstant prob¬ 
ability, all features are identifiable by triple 
queries. 

The above exponential upper and lower bounds 
are worst case. In fact, it is not difficult to see 
that for a totally heterogeneous erowd, which 
outputs a random distinguishing feature, if all 
Pi = 1/2, only N = O(logM) examples would 
suffice to discover all features because one could 
query multiple different people about each triple 
until one discovered all distinguishing features. 
Of course, in reality one would not expect a 
crowd to be completely homogeneous nor com¬ 
pletely heterogeneous (nor completely general¬ 
ists nor completely specifists), and one would 
not expect features to be completely indepen¬ 
dent or completely hierarchical. Instead, we 
hope that our analysis of certain natural cases 
helps shed light on why and when adaptivity 
can significantly help. 

As we now turn to the analysis of adaptivity and 
the number of queries, we make a “big data” 
assumption that we have an unbounded supply 
of examples. This makes the analysis simple 
in that the distribution over unresolved triples 
takes a nice form. We show that the number 
of queries required by the adaptive algorithm is 
linear in the number of features, while it grows 
exponentially with the number of features for 
any non-adaptive algorithm. 


We first provide an upper bound on the number 
of queries of the Adaptive Triple algorithm in 
this model. 

Lemma 5.3 (Adaptive Triple). Suppose for 
j — 1,... ,k, we have Mj independent features 
with frequeney pj and infinitely many examples. 
Then the expeeted number of queries used by 
Adaptive Triple to diseover all the features is at 
most where qj — 3p^(l — Pj). For the 

Adaptive Pair algorithm, set qj — 2pj(l — Pj)- 

We next provide lower bounds on the number 
of queries of any non-adaptive algorithm under 
the independent feature model. 

Lemma 5.4 (non-adaptive triple). Suppose for 
j — 1,... ,k, we have Mj independent features 
with frequeney pj and infinitely many examples. 
Let qj — 3p‘j{l — Pj)- The expeeted number of 
queries made by any non-adaptive triple algo¬ 
rithm is at least: 

1 ^max 

niid-«)"•’ 

where ^max = max^ qi. 

To interpret these results, consider the simple 
setting where all the features have the same 
probability: pj — p. Then the random triple al¬ 
gorithm requires at least 1/(1—q)^~^ queries on 
average to find all the features. This is exponen¬ 
tial in the number of features, M. In contrast, 
the adaptive algorithm at most M/q queries on 
average to find all the features, which is only 
linear in the number of features. 

6 Other types of queries 

Clearly 2/3 queries are not the only type of 
queries. For example, an alternative approach 
would use 1/3 queries in which one seeks a fea¬ 
ture that distinguishes one of the examples from 
the other two. Such queries could result in fea¬ 
tures that are very specific to one image and fail 
to elicit higher-level salient features. Under the 
hierarchical feature model, 1/3 queries alone are 
not guaranteed to discover all the features. 



A natural generalization of the left-vs-right 
queries in previous work m\M are queries with 
sets L and R of sizes \L\ < |i?| < r, where a 

valid answer is a feature common to all exam¬ 
ples in L and is in no examples in R. We refer 
to such a query as an £ — r query L — R. In 
fact, a 2/3 query on {x^y^z} may be simulated 
by running the three L-R queries {x^y} — {z}, 
{y^z} — {x}, and {x^z} — {y}. (Note that this 
may result in a tripling of cost, which is signif¬ 
icant in many applications.) There exist data 
sets for which L-R queries can elicit all features 
(for various values of r) while 2/3 queries may 
fail. 

Proposition 6.1. For any > 1, there exists 
a data set X of size N = \X\ = i + r and a fea¬ 
ture set F of size M = \F\ = l-\-£-\-r sueh that 
i — r queries ean eompletely reeover all features 
while no F — r' query ean guarantee the reeovery 
the first feature if i' < i or if F < r. 

Proof. Let the examples be X = L U i? where 
L = {xi, X 2 ,..., xi} and R = {x[, ..., Let 
F — {/} U G U iL where the feature / satisfies 
f{x) = 1 if X G L and f{x) = 0 if x G i?. Define 
the features ^ 1 ,^ 2 , • • • 5 ^^ ^ G to be gi{x) — 1 
for all X G L \ {x^} and gi{xi) = 0, otherwise. 
Define H = {/ii,..., hr} where hj{x) = 0 for all 
X G i? \ {xj} and hj{x) — 1, otherwise. It is 
clear that the query L — R necessarily recovers 
/, the query 0 — {x^} recovers and the query 
{x' } — 0 recovers hj. Moreover, for any query 
V — R' with Xi 0 IL'I, it is clear that gi is as good 
an answer as /. Conversely, if x' 0 i?', then 
clearly hj is as good an answer as /. Hence, 
if the feature / is “least salient” in that other 
features are always returned if possible, no F — F 
query will recover /. □ 

7 Experiments 

We tested our algorithm on three datasets: 1) a 
set of 100 silent video snips of a sign-language 
speaker llli 2 ) a set of 100 human face images 
used in a previous study ffU; 3) a set of 100 
images of ties, tiles and flags from that same 


study m- All the images and videos were ini¬ 
tially unlabeled. The goal was to automatically 
elicit features that are relevant for each dataset 
and to label all the items with these features. 
We implemented our Adaptive Triple algorithm 
on the popular crowdsourcing platform, Ama¬ 
zon Mechanical Turk, using two types of crowd¬ 
sourcing tasks. In a feature elieitation task, a 
worker is shown three examples and is asked to 
specify a feature that is common to two of the 
examples but is not present in the third. In a 
labeling task, a worker is shown one feature and 
all examples and is asked which examples have 
the feature. To reduce noise, we assigned each 
labeling task to five different workers, assigning 
each label by majority. 

To compare adaptivity to non-adaptivity, we 
implemented a Random Triple algorithm that 
picks a set of random triples and then queries 
them all. To compare triples to pairs, we also 
implemented an Adaptive Pair algorithm, de¬ 
fined in the analogous way to the random triple 
algorithm except that it only does pair queries. 

The Adaptive Triple algorithm automatically 
determines which sets of examples to elicit fea¬ 
tures from and which combination of exam¬ 
ple and feature to label. Figure shows the 
first five queries of the Adaptive Triple algo¬ 
rithm from one representative run on the three 
datasets. For example, on the faee data, af¬ 
ter having learned the broad gender features 
male and female early on, the algorithm then 
chooses all three female faces or all three male 
faces to avoid duplicating the gender features 
and to learn additional features. 

We compared the Adaptive Triple Algorithm 
to several natural baselines: 1 ) a non-adaptive 
triple algorithm that randomly selects sets of 
three examples to query; 2 ) the Adaptive Pairs 
algorithm; 3) the standard tagging approach 
where the worker is shown one example to tag 
at a time and is asked to return a feature that 
is relevant for the example. We used two com¬ 
plementary metrics to evaluate the performance 
of these four algorithms: the number of inter- 
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Figure 3: The first five features obtained from a representative run of the Adaptive Triple algorithm 
on the signs (left), faces (middle) and products (right) datasets. Each triple of images is shown in 
a row beside the proposed feature, and the two examples declared to have that feature are shown 
on the left, while the remaining example is shown on the right. 


signs 

adaptive triple 24.5 (3.8) 

random triples 12.5 (0.4) 

adaptive pairs 11.5 (1.1) 

tagging 9 (0.4) 


faces 

25.3 (0.3) 

18.7 (2.7) 
14.5 (1.8) 
13 (0.71) 


products 
19 (1.4) 

14 (1.4) 
10.5 (0.4) 
12 (0.4) 


Table 1: Number of interesting and distinct 
features discovered. Standard error shown in 
parenthesis. 


esting and distinct features the algorithm dis¬ 
covers, and how efficiently can the discovered 
features partition the dataset. 

In many settings, we would like to generate as 
many distinct, relevant features as possible. On 
a given data set, we measure the distance be¬ 
tween two features by the fraction of examples 
that they disagree on (i.e. the Hamming dis¬ 
tance divided by the number of examples). We 
say that a feature is interesting if it differs from 
the all 0 feature (a feature that is not present in 
any image) and from the all 1 feature (a features 
that is ubiquitous in all images) in at least 10% 
of the examples. A feature is distinct if it differs 
in at least 10% of the examples from any other 
feature. If multiple features are redundant, we 
represent them by the feature that was discov¬ 
ered first. 

Table [T] shows the number of interesting and dis¬ 


tinct features discovered by the four algorithms. 
On each dataset, we terminate the algorithm af¬ 
ter 35 feature elicitation queries. Each experi¬ 
ment was done in two independent replicates- 
different random seeds and Mechanical Turk 
sessions. The Adaptive Triple algorithm discov¬ 
ered substantially more features than all other 
approaches in all three datasets. The non- 
adaptive approaches (random triples and tag¬ 
ging) were hampered by repeated discoveries 
of a few obvious features-one/two-handed mo¬ 
tions in signs, male/female in faces and prod¬ 
uct categories in products. Once Adaptive 
Triples learned these obvious features, it pur¬ 
posely chose sets of examples that cannot be 
distinguished by the obvious features in order 
to learn additional features. Adaptive compar¬ 
ison of pairs of example performed poorly not 
because of redundant features but because after 
it learned a few good features, all pairs of ex¬ 
amples can be distinguished and the algorithm 
ran out of useful queries to make. This is in 
agreement with our analysis of hierarchical fea¬ 
tures. Pairwise comparisons are only guaran¬ 
teed to find the base-level features of the hi¬ 
erarchy while triples can provably find all the 
features. 

To evaluate how efficiently the discovered fea¬ 
tures can partition the dataset, we compute the 
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Figure 4: Comparisons of the adaptive triple 
algorithm with benchmarks. 


average size of the partitions induced by the 
first k discovered features. More precisely, let 
ft be the tth discovered feature. Then fea¬ 
tures induces a partition on the ex¬ 

amples, Pi,...,Pr, such that examples Xi^xj 
belong to the same partition if they agree on 
all the features /i,...,//^. The average fraction 
of indistinguishable images is A}) = 

Before any feature were discov¬ 
ered, g — 1. If features perfectly distinguish 
every image, then g = 1/N. 

In Figure]^ we plot the value of g for the adap¬ 
tive triple algorithm and the benchmarks as a 
function of number of queries. The adaptive 
algorithms requires significantly fewer queries 
to scatter the images compared to the non- 
adaptive algorithms. On the sign data set, for 
example, the adaptive triple required 13 queries 
to achieve g = 0.05 (i.e. a typical example is in¬ 
distinguishable from 5% of examples), while the 
random triples required 31 queries to achieve 
the same g = 0.05. Adaptive Triples and 
Adaptive Pairs both achieved rapid decrease in 
indicating that both were discovering good 
discriminatory features. However, as we saw 
above. Adaptive Pairs terminated early because 
it no longer had any unresolved pairs of exam¬ 
ples to query, while Adaptive Triples continued 
to discover new features. 


queries. Consistent with previous work m, 
we demonstrated that tagging can be inefficient 
for generating features that are diverse and dis¬ 
criminatory. Our theoretical analysis suggested 
that the Adaptive Triple algorithm can effi¬ 
ciently discover features, and our experiments 
on three data sets provided validation for the 
theoretical predictions. Moreover, unlike previ¬ 
ous non-adaptive feature elicitation algorithms 
which had to detect redundant features (either 
using humans or natural language processing), 
our algorithm is designed to avoid generating 
these redundant features in the first place. 

A key reason that our algorithm outperformed 
the non-adaptive baseline is that in all three 
of our data sets there were some features that 
were especially salient, namely gender for faces, 
one or two hands for sign language, and prod¬ 
uct type for products. A interesting direction 
of future work would be to investigate the per¬ 
formance of adaptive algorithms in other types 
of data. 

Our analysis suggests that homogeneous crowds 
and crowds of generalists should be most chal¬ 
lenging for eliciting features. Modeling the 
salience of features and the diversity of the 
crowd are also interesting directions of future 
work. In particular, our algorithm made no ex¬ 
plicit attempt to find the most salient features, 
e.g., one could imagine aggregating multiple 2/3 
responses to find the most commonly mentioned 
features. In addition, one could leverage the 
fact that different users find different features to 
be salient and model the diversity of the crowd 
to extract even more features. 
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A Analysis of the hierarchical feature model 


Proof. (Proof of Lemma 4.4) Let (xi^xj^xk) be any triplet of examples. Let fica be the lowest 
common ancestor of xj and x^ in T; that is, fica is the lowest feature / in T such that Xij — 
Xjj — Xkj = 1. If flea is also the lowest common ancestor of any two out of (xi^xj^Xk)^ then the 
query {x^^Xj^Xi} will return NONE] otherwise it returns a node feature. 


Recall that in the Adaptive Hybrid algorithm, after the double queries in step 1, we associate each 
feature fj with a single example. Thus, for the rest of the proof we assume that there exists a 
one-to-one mapping between an example and a feature at the leaf node of T. 


Let / be any feature in T, and let Lf he the subset of triples {x^y^z) such that / is the lowest 
common ancestor to x, y and z. For any triple (x, z)^ let /(x, 2 ;) = 1 if one of the triple queries 
{x,y,z} returns a feature; otherwise let I{x^y^z) = 0. The total number of triple queries which 
will return a feature can be written as: J2feT y z)eLf ^)- 

Suppose that / has k children in T. Let ni > 122 he the number of examples associated 

with these children. We have two cases. 


In the first case, ni > 2. Let us call such a feature heavy. In this case, querying any triple (x, ?/, z) 
where x and y are from the first child will result in a feature. The fraction of such triples in Lj is 


at least 


ni{ni-l) ^ 1 


Thus, for a heavy /, Y.(x,y,z)£Li ^ 


\Lf\ 

2£>2- 


In the second case, ni — 1. Call such an / a light feature. As T is not a star, there exists at least 
one leaf I G T which does not have / as an ancestor. Consider triples of the form (x^y^l) where x 
and y are descendants of / such that / is their lowest common ancestor, and let Sf^i be the set of 
all such triples. 


It turns out that Sf^i has some nice properties. First, \Sf^i\ > \Lf\/D] this is because if / has k < D 
children, then, \Sf^i\ = ( 2 ) while \Lf\ = (g). Second, if {x^y^l) is a triple in Sf^i^ then the queries 
(x, y, /) will return a new feature. Finally, suppose we map each light feature / to the set then, 
the sets Sf^i and Sf/^i are disjoint when f ^ f'. 

Therefore, 

5] \Lf\ < D\Sf,i\ <dJ2 E 

light / light / / {x,y,z)eLf 

Combining the two cases, we get: 

\Lf\+ T. E E 

light/ heavy/ / {x,y,z)eLf f {x,y,z)eLf 


Therefore, if we draw a random triple of examples from the subtree below /, and make the cor¬ 
responding three triple queries, the probability that we get a new feature is > The lemma 

follows. □ 


Proof. (Proof of Proposition |4.3| ) We begin by observing that any time the queries in Step 5 of the 
algorithm return a feature, it must be a new feature that we haven’t seen before. 




Each leaf feature / is the unique solution to the double query where x is under / and y is 

under a sibling leaf feature. Thus, all the leaf features are identified by double queries. Moreover, 
the double queries return at most N NONE answers. 


Let / be the feature that we have currently popped from the queue Q, i.e. the feature that we are 
currently exploring. Let Tf be the induced subtree of T with root at / and leaves the set off(/). 
Note that Tf is the true underlying subtree (that is, not the subtree that we have found), and it is 
also D-aij. The Adaptive Hybrid algorithm now randomly samples triples of examples from off(/) 
to query. If Tj is a star, then there are no new features to be fou nd and this subroutine stops 


after 9 queries. Otherwise, when 9 — 0{D^ log ^), from Lemma 4.4, with probability > 1 — it 


returns a new feature with high probability. Therefore the probability of finding all M features is 
>1 — 5. The algorithm terminates after 0{N + MD^ log H-) total queries. □ 


B Analysis of the independent feature model 


Proof. (Proof of Lemma 5.2) Let / be any feature, and let Xi and Xj be a randomly drawn pair 
of examples from D. The probability that / satisfies the double query (xi^xj) is 2pj(l — p/); 
moreover, the probabilty that / is the only feature that satisfies this query is Aj = 2 pj(l — 
Pf)Ug^f - Pg))- 


Now consider the process of drawing N/2 pairs of random examples from D. The probability that 
the i-th pair (x, y) is such that the double query (x, y) is uniquely satisfied by feature / is Aj. The 
first part of the lemma follows from a coupon collector’s argument. The proof of the second part 
is very similar. □ 


Proof. (Proof of Lemma[5^ Let fj be a feature with frequency pj^ and let (xi, X 2 , X 3 ) be a randomly 
drawn triple of examples. The probability that fj satisfies the triple query (xi, X2, X3) is qj = 3p?(l — 
Pj). Let T be the full set of features. Suppose we have already seen the set of features S. Then 

the probability that the next query will discover an unseen feature is at least: 1 — ~ Qj)^ 

and therefore the expected time to discover the next unseen feature is at most: 

1 

This quantity is an decreasing function of qj. Thus, the worst case order of discovering features 

that maximizes the expected discovery time is from high to low values of qj. 

WLOG we will assume that > ^2 ^ ^ Qm- The total expected discovery time is at most: 


M 


E 


1 


M 


sE 


1 

1 - (1 - 9i) 


□ 


Proof. (Proof of Lemma 5.4) Suppose that the features are discovered according to some order tt. 








Then, the probability a random triple elicits the last feature i is: 

n (1 - Q7t{N))Q7t{N) 

i<N 

Of course this is minimized when is minimized. Although a general adaptive algorithm 

can have a structure to the triples it chooses, we can use the union bound to argue to bound the 
probability that any triple elicits the last feature. In this case, each triple is essentially random. □ 
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